In this document, I am hoping to mostly copy/paste material from the tests/ tree and explain the various functionalities therein. It is my hope therefore to step from data loading all the way through ontology searching with appropriate visualizations at each stage.
In test_01load_data.R I perform load some data into an expressionset and get ready to play with it.
## I use sm to keep functions from printing too much (well, anything really)
tt <- sm(library(hpgltools))
tt <- sm(library(pasilla))
tt <- sm(data(pasillaGenes))
biomart is an excellent resource for annotation data, but it is entirely too complex. The following function ‘get_biomart_annotations()’ attempts to make that relatively simple.
## Try loading some annotation information for this species.
gene_info_lst <- sm(load_biomart_annotations(species="dmelanogaster",
host="useast.ensembl.org"))
gene_info <- gene_info_lst[["annotation"]]
info_idx <- gene_info[["gene_biotype"]] == "protein_coding"
gene_info <- gene_info[info_idx, ]
rownames(gene_info) <- make.names(gene_info[["ensembl_gene_id"]], unique=TRUE)
head(gene_info)
## ensembl_transcript_id ensembl_gene_id description gene_biotype
## FBgn0260439 FBtr0005088 FBgn0260439 NA protein_coding
## FBgn0000056 FBtr0006151 FBgn0000056 NA protein_coding
## FBgn0031081 FBtr0070000 FBgn0031081 NA protein_coding
## FBgn0031085 FBtr0070002 FBgn0031085 NA protein_coding
## FBgn0062565 FBtr0070003 FBgn0062565 NA protein_coding
## FBgn0031089 FBtr0070006 FBgn0031089 NA protein_coding
## cds_length chromosome_name strand start_position end_position
## FBgn0260439 1776 2L + 8366038 8370090
## FBgn0000056 819 2L + 14615552 14618902
## FBgn0031081 2361 X + 19961297 19969323
## FBgn0031085 633 X + 20051294 20052519
## FBgn0062565 1164 X + 20094398 20095767
## FBgn0031089 1326 X + 20148124 20155514
The pasilla data set provides count tables in a tab separated file, let us read them into an expressionset in the following block along with creating an experimental design. create_expt() will then merge the annotations, experimental design, and count tables into an expressionset.
## This section is copy/pasted to all of these tests, that is dumb.
datafile <- system.file("extdata/pasilla_gene_counts.tsv", package="pasilla")
## Load the counts and drop super-low counts genes
counts <- read.table(datafile, header=TRUE, row.names=1)
counts <- counts[rowSums(counts) > ncol(counts),]
## Set up a quick design to be used by cbcbSEQ and hpgltools
design <- data.frame(row.names=colnames(counts),
condition=c("untreated","untreated","untreated",
"untreated","treated","treated","treated"),
libType=c("single_end","single_end","paired_end",
"paired_end","single_end","paired_end","paired_end"))
metadata <- design
colnames(metadata) <- c("condition", "batch")
metadata[["sampleid"]] <- rownames(metadata)
## Make sure it is still possible to create an expt
pasilla_expt <- sm(create_expt(count_dataframe=counts, metadata=metadata,
savefile="pasilla", gene_info=gene_info))
The rest of test_01load_data.R checks the various slots of the resulting expt to ensure that important stuff for future analyses are available, primarily: condition/batch, library sizes, annotations, counts.
The next set of tests seek to ensure that the various plots used to visualize and understand trends in the data are maintained over time.
In this first block I will use a single function graph_metrics() to plot them all. And then follow up with the one at a time. Many functions in hpgltools are quite chatty with liberal usage of message(), as a result I will sm() this call to shut it up.
pasilla_metrics <- sm(graph_metrics(pasilla_expt, ma=TRUE, qq=TRUE))
summary(pasilla_metrics)
## Length Class Mode
## boxplot 9 gg list
## corheat 3 recordedplot list
## cvplot 9 gg list
## density 10 gg list
## density_table 5 data.table list
## disheat 3 recordedplot list
## gene_heatmap 0 -none- NULL
## legend 3 recordedplot list
## legend_colors 3 data.frame list
## libsize 9 gg list
## libsizes 4 data.table list
## libsize_summary 7 data.table list
## ma 21 -none- list
## nonzero 9 gg list
## nonzero_table 7 data.frame list
## pc_loadplot 3 recordedplot list
## pc_summary 4 data.frame list
## pc_propvar 6 -none- numeric
## pc_plot 9 gg list
## pc_table 14 data.frame list
## qqlog 3 recordedplot list
## qqrat 3 recordedplot list
## smc 9 gg list
## smd 9 gg list
## topnplot 9 gg list
## tsne_summary 4 data.frame list
## tsne_propvar 20 -none- numeric
## tsne_plot 9 gg list
## tsne_table 10 data.frame list
Now let us print the graphs
pasilla_metrics$libsize
## The library sizes range from 8-21 million reads, this might be a problem for
## some analyses.
pasilla_metrics$nonzero
## Ergo, the lower abundance libraries have more genes of counts == 0 (bottom
## left).
pasilla_metrics$boxplot
## And a boxplot downshifts them (but not that much because it decided to put
## the data on the log scale).
pasilla_metrics$density
## Similarly, one can see those samples are a bit lower with respect to density
## Unless the data is very well behaved, the rest of the plots are not likely to
## look good until the data is normalized, nonetheless, lets see
pasilla_metrics$corheat
pasilla_metrics$disheat
pasilla_metrics$pc_plot
## So the above 3 plots are pretty much the worst case scenario for this data.
The most common normalization suggested by Najib is a cpm(quantile(filter(data))). On top of that we often do log2() and/or a batch adjustment. default_norm() quietly does the first and may be supplemented with other arguments.
norm <- default_norm(pasilla_expt, transform="log2")
norm_metrics <- graph_metrics(norm)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
norm_metrics$corheat
norm_metrics$smc
norm_metrics$disheat
norm_metrics$smd
## some samples look a little troublesome here.
norm_metrics$pc_plot
pasilla_pairwise <- sm(all_pairwise(pasilla_expt))
pasilla_tables <- sm(combine_de_tables(
pasilla_pairwise,
excel="pasilla_tables.xlsx"))
pasilla_sig <- sm(extract_significant_genes(
pasilla_tables,
excel="pasilla_sig.xlsx"))
pasilla_ab <- sm(extract_abundant_genes(
pasilla_pairwise,
excel="pasilla_abundant.xlsx"))
pasilla_tables$plots[[1]][["deseq_ma_plots"]]$plot
pasilla_tables$plots[[1]][["edger_ma_plots"]]$plot
pasilla_tables$plots[[1]][["limma_ma_plots"]]$plot
up_genes <- pasilla_sig[["deseq"]][["ups"]][[1]]
down_genes <- pasilla_sig[["deseq"]][["downs"]][[1]]
pasilla_go <- load_biomart_go(species="dmelanogaster")$go
## Cache found
## Finished downloading ensembl go annotations, saving to dmelanogaster_go_annotations.rda.
## Saving ontologies to dmelanogaster_go_annotations.rda.
## Finished save().
pasilla_length <- fData(pasilla_expt)[, c("ensembl_gene_id", "cds_length")]
colnames(pasilla_length) <- c("ID", "length")
pasilla_goseq <- simple_goseq(sig_genes=up_genes, go_db=pasilla_go, length_db=pasilla_length)
## Using the row names of your table.
## Found 104 genes out of 113 from the sig_genes in the go_db.
## Found 102 genes out of 113 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
pasilla_goseq$pvalue_plots$bpp_plot_over
pasilla_goseq <- simple_goseq(sig_genes=down_genes, go_db=pasilla_go, length_db=pasilla_length)
## Using the row names of your table.
## Found 99 genes out of 109 from the sig_genes in the go_db.
## Found 97 genes out of 109 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
pasilla_goseq$pvalue_plots$bpp_plot_over
test <- simple_goseq(sig_genes=names(pasilla_ab$abundances$deseq$treated), go_db=pasilla_go, length_db=pasilla_length)
## Found 188 genes out of 200 from the sig_genes in the go_db.
## Found 181 genes out of 200 from the sig_genes in the length_db.
## Using manually entered categories.
## Calculating the p-values...
## 'select()' returned 1:1 mapping between keys and columns
## simple_goseq(): Calculating q-values
## simple_goseq(): Filling godata with terms, this is slow.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
## simple_goseq(): Making pvalue plots for the ontologies.
test$pvalue_plots$bpp_plot_over
pander::pander(sessionInfo())
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: pasilla(v.1.14.0), GO.db(v.3.10.0), AnnotationDbi(v.1.48.0), GOstats(v.2.52.0), edgeR(v.3.28.0), variancePartition(v.1.16.1), fission(v.1.6.0), ruv(v.0.9.7.1), SummarizedExperiment(v.1.16.1), DelayedArray(v.0.12.2), BiocParallel(v.1.20.1), matrixStats(v.0.55.0), GenomicRanges(v.1.38.0), GenomeInfoDb(v.1.22.0), IRanges(v.2.20.2), S4Vectors(v.0.24.3), hpgltools(v.1.0), Biobase(v.2.46.0) and BiocGenerics(v.0.32.0)
loaded via a namespace (and not attached): rappdirs(v.0.3.1), SparseM(v.1.78), rtracklayer(v.1.46.0), AnnotationForge(v.1.28.0), R.methodsS3(v.1.8.0), tidyr(v.1.0.2), ggplot2(v.3.2.1), acepack(v.1.4.1), bit64(v.0.9-7), knitr(v.1.28), R.utils(v.2.9.2), data.table(v.1.12.8), rpart(v.4.1-15), RCurl(v.1.98-1.1), doParallel(v.1.0.15), GenomicFeatures(v.1.38.2), preprocessCore(v.1.48.0), callr(v.3.4.2), usethis(v.1.5.1), RSQLite(v.2.2.0), bit(v.1.1-15.2), xml2(v.1.2.2), httpuv(v.1.5.2), assertthat(v.0.2.1), xfun(v.0.12), hms(v.0.5.3), evaluate(v.0.14), promises(v.1.1.0), DEoptimR(v.1.0-8), fansi(v.0.4.1), progress(v.1.2.2), caTools(v.1.18.0), dbplyr(v.1.4.2), Rgraphviz(v.2.30.0), DBI(v.1.1.0), geneplotter(v.1.64.0), htmlwidgets(v.1.5.1), purrr(v.0.3.3), ellipsis(v.0.3.0), crosstalk(v.1.0.0), selectr(v.0.4-2), dplyr(v.0.8.4), backports(v.1.1.5), annotate(v.1.64.0), biomaRt(v.2.42.0), blockmodeling(v.0.3.6), vctrs(v.0.2.3), remotes(v.2.1.1), withr(v.2.1.2), robustbase(v.0.93-5), checkmate(v.2.0.0), GenomicAlignments(v.1.22.1), prettyunits(v.1.1.1), cluster(v.2.1.0), lazyeval(v.0.2.2), crayon(v.1.3.4), genefilter(v.1.68.0), pkgconfig(v.2.0.3), labeling(v.0.3), nlme(v.3.1-144), pkgload(v.1.0.2), nnet(v.7.3-12), devtools(v.2.2.2), rlang(v.0.4.4), lifecycle(v.0.1.0), affyio(v.1.56.0), BiocFileCache(v.1.10.2), directlabels(v.2020.1.31), rprojroot(v.1.3-2), graph(v.1.64.0), Matrix(v.1.2-18), boot(v.1.3-24), base64enc(v.0.1-3), geneLenDataBase(v.1.22.0), processx(v.3.4.2), png(v.0.1-7), viridisLite(v.0.3.0), bitops(v.1.0-6), R.oo(v.1.23.0), KernSmooth(v.2.23-16), pander(v.0.6.3), Biostrings(v.2.54.0), EBSeq(v.1.26.0), blob(v.1.2.1), stringr(v.1.4.0), qvalue(v.2.18.0), readr(v.1.3.1), jpeg(v.0.1-8.1), scales(v.1.1.0), memoise(v.1.1.0), GSEABase(v.1.48.0), magrittr(v.1.5), plyr(v.1.8.5), gplots(v.3.0.1.2), gdata(v.2.18.0), zlibbioc(v.1.32.0), compiler(v.3.6.1), RColorBrewer(v.1.1-2), lme4(v.1.1-21), DESeq2(v.1.26.0), Rsamtools(v.2.2.2), cli(v.2.0.1), ade4(v.1.7-15), affy(v.1.64.0), XVector(v.0.26.0), Category(v.2.52.1), ps(v.1.3.2), htmlTable(v.1.13.3), Formula(v.1.2-3), MASS(v.7.3-51.5), mgcv(v.1.8-31), tidyselect(v.1.0.0), stringi(v.1.4.6), highr(v.0.8), yaml(v.2.2.1), askpass(v.1.1), locfit(v.1.5-9.1), latticeExtra(v.0.6-29), ggrepel(v.0.8.1), grid(v.3.6.1), tools(v.3.6.1), rstudioapi(v.0.11), foreach(v.1.4.8), foreign(v.0.8-75), gridExtra(v.2.3), farver(v.2.0.3), Rtsne(v.0.15), digest(v.0.6.24), BiocManager(v.1.30.10), shiny(v.1.4.0), quadprog(v.1.5-8), Rcpp(v.1.0.3), later(v.1.0.0), httr(v.1.4.1), genoPlotR(v.0.8.9), colorspace(v.1.4-1), rvest(v.0.3.5), XML(v.3.99-0.3), fs(v.1.3.1), topGO(v.2.38.1), splines(v.3.6.1), RBGL(v.1.62.1), plotly(v.4.9.2), sessioninfo(v.1.1.1), xtable(v.1.8-4), jsonlite(v.1.6.1), nloptr(v.1.2.1), corpcor(v.1.6.9), testthat(v.2.3.1), R6(v.2.4.1), Vennerable(v.3.1.0.9000), Hmisc(v.4.3-1), pillar(v.1.4.3), htmltools(v.0.4.0), mime(v.0.9), glue(v.1.3.1), fastmap(v.1.0.1), minqa(v.1.2.4), DESeq(v.1.38.0), codetools(v.0.2-16), pkgbuild(v.1.0.6), lattice(v.0.20-40), tibble(v.2.1.3), sva(v.3.34.0), pbkrtest(v.0.4-7), curl(v.4.3), BiasedUrn(v.1.07), colorRamps(v.2.3), gtools(v.3.8.1), zip(v.2.0.4), openxlsx(v.4.1.4), openssl(v.1.4.1), survival(v.3.1-8), limma(v.3.42.2), rmarkdown(v.2.1), desc(v.1.2.0), munsell(v.0.5.0), fastcluster(v.1.1.25), GenomeInfoDbData(v.1.2.2), iterators(v.1.0.12), goseq(v.1.38.0), reshape2(v.1.4.3) and gtable(v.0.3.0)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset b2281875fa9d39100f351eaf431efc6eecbfbe05
## This is hpgltools commit: Thu Mar 5 14:48:45 2020 -0500: b2281875fa9d39100f351eaf431efc6eecbfbe05